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Quantum search algorithms are considered in the context of protein sequence comparison in 
biocomputing. Given a sample protein sequence of length m (i.e m residues) , the problem considered 
is to find an optimal match in a large database containing N residues. Initially, Grover's quantum 
search algorithm is applied to a simple illustrative case - namely where the database forms a complete 
set of states over the 2 m basis states of am qubit register, and thus is known to contain the exact 
sequence of interest. This example demonstrates explicitly the typical 0(y/~N) speedup on the 
classical O(N) requirements. An algorithm is then presented for the (more realistic) case where 
the database may contain repeat sequences, and may not necessarily contain an exact match to the 
sample sequence. In terms of minimizing the Hamming distance between the sample sequence and 
the database subsequences the algorithm finds an optimal alignment, in 0(s/N) steps, by employing 
an extension of Grover's algorithm, due to Boyer, Brassard, H0yer and Tapp for the case when the 
number of matches is not a priori known. 
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The fantastic possibilities of quantum parallelism in 
computing, suggested by the convergence of quantum me- 
chanics and information theory in the past two decades, 
are fast being enumerated in the guise of quantum algo- 
rithms. First, and foremost, among these is the factoring 
algorithm of Shor pi], which provided great impetus to 
the field of quantum computing. Shor's algorithm ap- 
plied to a given number N requires 0((logiV) 3 ) steps, 
and represents an exponential speed-up over the best 
classical algorithms. Another important result, due to 
Grover || , was the discovery of a quantum search algo- 
rithm for finding a particular element in an unordered 
set of N elements in only 0(y/~N) steps - a significant 
improvement over the classical cost O(N). 

In this paper the application of quantum search algo- 
rithms to an important problem at the heart of biocom- 
puting (or bioinformatics), that of protein sequence com- 
parison and alignment, is considered. As the mapping 
and sequencing of the human genome (some 3 x 10 9 base 
pairs) nears completion, the relatively new field of bio- 
computing has become obvious in its importance to the 
quantitative analysis of this vast amount of data. Some 
fundamental tasks in biocomputing involving sequence 
analysis include: searching databases in order to com- 
pare a new sub-sequence to existing sequences, inferring 
protein sequence from DNA sequence, and calculation of 
sequence alignment in the analysis of protein structure 
and function. A tremendous amount of computing is 
required, much of which is devoted to search-type prob- 
lems, either directly in large databases, or in configura- 
tion space of alignment possibilities. While it is possible 
that all of these problems may be amenable to quantum 
algorithmic speed-up, it is explicitly demonstrated in this 
work how the fundamental task of sequence alignment 
can be approached using a quantum computer. Indeed, 



this problem is a very natural application of the quan- 
tum search algorithm (perhaps a strange reflection of the 
possibility that the machinery of DNA itself may actually 
function using quantum search algorithms ||). 

In general terms Grover's search algorithm relies on 
the existence of a quantum computer Q operating us- 
ing an oracle function, F. The set of search possibili- 
ties is represented by states in the Hilbert space of Q. 
The oracle function simply tests whether a given state is 
the actual target state. Grover found a unitary opera- 
tor U (involving the oracle function test) which evolves 
the quantum computer in such a way that the amplitude 
of the target state in the wave function of Q is ampli- 
fied. Furthermore, Grover showed that there exists a 
number fc < \/~N, such that after k applications of U, 
the probability of finding the target state is at least 1/2. 
Subsequently, Boyer, Brassard, H0yer and Tapp (BBHT) 
proved a tighter bound: one must iterate the algorithm 
on average at least (sin ^) y/N times to achieve a prob- 
ability of 1/2 for finding the target Q]. 

To begin the application of quantum search algorithms 
to protein sequence analysis, the problem of sequence 
alignment to a large database of sequence domains is con- 
sidered. That is, given a sample sequence the task is to 
find out the location in the database of an exact or clos- 
est match (with respect to some defined measure). Ap- 
plication of the Grover algorithm directly to this search 
task would cause trouble immediately because, by defini- 
tion, it is not known if the target exists in the database, 
or if it actually exists multiple times. If there are ac- 
tually N t solutions, the number of iterations required to 
find a solution with probability 1/2 is (sin | ) \JN/N t [§• 
Thus, if one does not know the number of solutions at the 
outset, the computer may inadvertently be halted when 
the amplitude of the target states is very small. This 
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happens because the process of amplitude amplification 
is not monotonic, but rather oscillates with the number 
of iterations. Fortunately, this difficult impasse has been 
solved by BBHT and they provide an algorithm, based on 
Grover's algorithm as a subroutine, for finding a solution 
in the case where the number of solutions is unknown [Q . 
This result allows for the application of quantum search 
algorithms to the field of biocomputing. 

In terms of protein sequences, the human genome is 
composed of about 150,000 domains, each containing on 
average 300 residues (amino acids). An interesting fea- 
ture of approaching the sequence analysis problem using 
a quantum computer is that the entire database could 
in principle be stored in a single wave function super- 
position, and then be presented simultaneously for in- 
spection. To illustrate the basic idea, a very simple 
case of sequence comparison is initially considered, fol- 
lowed by a more realistic problem later. Consider a 
database, D, constructed from the domains of the human 
genome placed end-to-end, so that a continuous list of N 
residues, D = {Rq, R^-i}, is created. Indepen- 

dently, a sample sequence is given s = {r"o,ri, ...,r m _ 1 } 
composed of m residues; the task is to compare this with 
the database. Each residue is labeled by a letter of the 
20- letter amino acid alphabet, so in order to encode the 
database 5 bits per residue is needed. Thus, the residues 
Ri and are represented by bit strings, I1q=o an< ^ 
rL=o b iot, respectively. 

The quantum computer to analyze this system is com- 
posed of two registers, with number of qubits Q\ and Q2, 
respectively. The bit-wise representation of the protein 
sequences will be encoded into the qubits of this system. 
Leaving issues of data transfer aside, the entire database 
is represented by a quantum superposition over the two 
registers: 

1 N—m 

\9 D )= = \<M ®N>< (!) 

VN-m+1 ^ 

where all the consecutive sub-sequences in the database 
of length to are encoded in the first register with Qx = 5to 
as 

i+m—l 4 5m — 1 

\4>i)= n ni s ^)= n i«*>- ^ 

a=i /3=0 Q=0 

That is, from from the database of length N residues, 
N — to + 1 sub-sequences of length m are constructed 
by moving along from the first position (allowing domain 
crossing). Position information of the sub-sequences is 
meanwhile tagged explicitly by binary numbers, |£), in 
the second register, and is accessed by an operator, X, 
acting in the Hilbert space of the second register, which 
gives the position as X \ i) = i \i) (0 < i < N — to). In 
order that this register can encode all positions Qi must 
satisfy 2® 2 > N — m. The number of qubits required 
in this register is relatively small: taking the database 



size to be that for the number of residues in the human 
genome implies Q2 = 26 suffices. In the first register, typ- 
ical sequence comparison problems require to ~ O(300). 

The next step in the initialization process is the coding 
of a table, T[0...N — to], into the quantum state, which 
measures the difference between the database states \4>i) 
and the sample sequence state in terms on the total num- 
ber of bit flips required to transform any database state 
into the sample sequence. In other words, T[0...N — to] 
is the set of Hamming distances. Remarkably, the set 
of Hamming distances for the entire database can be cre- 
ated by simply acting on each qubit of the computer with 
a CNOT operation with respect to the sample sequence 
state: 

N—m 

|*g) = Ugnot(s) \* d ) = -7= l^>®l z >- 

(3) 

Denoting the individual qubits of the "Hamming states" 
\$i) by 

5m— 1 

\&)= n i»«>' w 

an operator, T, is introduced which, acting on a state 
\4>i), gives the Hamming distance table value T[i] as: 

5m— 1 

f : T|&)=T[t]|&), T[*]= £ q ia . (5) 

With the computer design completed and initialized, a 
simple search problem can be defined in order to demon- 
strate how the computer works. First, the database is 
taken to be of length N = 2 ,n + m — 1 so that there are 
exactly 2 m states in the superposition, and furthermore 
demand that all these states are distinct. The problem 
is to search the database for the sub-sequence s, which 
occurs exactly once, but at an unknown location. Classi- 
cally, this would require 0{N) steps. However, by using 
Grover's search algorithm, the match can be found in 
0(y/~N) steps. In this example, the database decompo- 
sition has been artificially arranged to be over a com- 
plete set of states of the first register, which means that 
Grover's search algorithm can be applied directly. 

The problem defined by Grover has been modi- 
fied slightly, but the applicability of the search algo- 
rithm remains. The original problem was defined in 
terms of an oracle function, F(x), over a set of val- 
ues x S {0, . . . , N — 1}, which is zero everywhere ex- 
cept at some value t, the target of the search, where 
F(t) — 1. The sequence comparison problem here has 
been re-structured so that a value of x represents a sub- 
sequence of the database, and the oracle function is just a 
direct comparison with the sample sequence. In a sense, 



2 



the black box nature of the oracle function has been sim- 
plified, at the cost of increasing the complexity of the ini- 
tial wave function with position information. It remains 
to be seen whether this is a feasible way of coding a se- 
quence database. Of course, an alternative is to sweep 
all details of the database look-up and comparison into 
the oracle function. The difference is subtle, and per- 
haps non-trivial in practice. The advantage of the latter 
approach might be in the initialization of the quantum 
computer state. The algorithms presented here would 
still apply in this case. 

In the computer design defined here, Grover's search 
algorithm is applied to the first register containing the 
sub-sequence state superposition. The problem is to find 
the state \s) = J7cnot|s) = |0 . . . 0) (zeros in all m qubits 
of the first register) with table value T[i s ] = 0, occurring 
at position i s (as yet unknown). Once the state is found, 
the location of the sequence in the database can be de- 
termined by making a measurement of X on the second 
register. 

To illustrate the working of the algorithm the geomet- 
rical picture |^,^,^, which is particularly transparent, is 
applied to this framework. The search algorithm is ini- 
tiated by decomposing the state \^h) into orthogonal 
components with respect to |s) as 



N — m 
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The evolution of the quantum computer representing 
the search algorithm occurs in the first register, the sec- 
ond register lying dormant, yet through quantum entan- 
glement carrying the position information required at the 
end. The operator U is constructed from reflection oper- 
ators in the Hilbert space of the first register, 



I S = 1-2\S)(S\ 
I H = l-2|tftf}(tf ff | 



(8) 



The operator Is contains the query to the oracle func- 
tion, F(i), and acts on the Hamming states \4>i) with a 
phase shift dependent on the search criteria T[i s ] = 0: 

f - |&) if T[i] = 
I s \$ i ) = (-l) F ®\$ i ) = l _ (9) 

I \4>i) otherwise. 

In terms of these reflection operators, the unitary op- 
erator evolving the system through one step of the search 
algorithm is given by U = —Ih Is- The evolution of the 
computer proceeds through application of the operator, 
U, a number of times on the initial state, \^h)- The ef- 
fect of this evolution is to amplify the component of the 



target state, \S) in the superposition. It is important to 
understand the nature of this process in order to appre- 
ciate how the quantum computer functions. To see this 
point it is convenient to express U in the representation 
of the subspace {\S), \R)}- 



U 



N-m+l N—m+1 



N-m+1 



N-m-l 
N-rn+1 



cos tt sin ( 



— sin tf cos ( 



(10) 



where sin 9 = - m/(N - m + 1). 

After k steps of the algorithm the state of the computer 
is given by 



N—m 



(ii) 



(k) 

The amplitude of the target state, c\ , can be easily 
calculated using the matrix representation for U . One 
obtains: 



c$ = cos (k — a) , cos a 



^/N - m + 1 



(12) 



The component along \S) is amplified to near unity at 
fcmax ~ j"v~N (for N » to). A measurement of T on 
the first register will give a result T[i] with probability 

|cj | 2 . If T[i] = then the algorithm has succeeded - i.e 
the sample sequence has been found - and a subsequent 
measurement of X in the second register will give the po- 
sition, i s , of the sequence in the database. A crucial point 
is that one has to be careful interpreting the number of 
steps required to obtain a successful outcome - merely 
increasing the number of steps beyond fc ma x does not im- 
prove the chances of success because the amplification is 
not monotonic. Indeed, the probability of success actu- 
ally decreases when fc max is exceeded. The search may 
therefore have to be run several times, however, for large 
iV the savings in computer time compared to a classical 
computer are clear, even if the search is repeated several 
times. 

While the above example serves to display the poten- 
tial of quantum search algorithms in the context of se- 
quence matching to a large database, it does not con- 
tain an important concept in bioinformatics - optimal 
alignment. Generally, the sample sequence may not be 
contained exactly in the database, and so one is inter- 
ested in how close is the best match (or matches) with 
respect to a well defined distance measure. Often this 
measure involves editing of strings by insertion of gaps 
in order to minimize the distance; in practice this process 
is very complicated. In the first instance, the problem is 
extended to that of finding an optimal alignment with 
respect to the Hamming distance, without editing of se- 
quences (which can be incorporated at a later stage). 



3 



Let us first define the problem using, as far as possi- 
ble, the same notation as previously. The database is 
taken to be of size N >> m, but the restriction that 
the set of database sub-sequence states is equal to 2 m is 
relaxed, and the possibility is allowed that the set of sub- 
sequences may contain repeats, and, more importantly, 
may or may not contain the sample sequence. The prob- 
lem then is to find an optimal alignment of the sample 
sequence to a sub-sequence in the database. An opti- 
mal alignment here is defined in the sense of finding the 
smallest Hamming distance T[i] with respect to the sam- 
ple sequence state. 

In terms of our quantum computer, the database state 
in this case is also described by the state \^d}- An im- 
portant point is that the state is still normalised by the 
factor l/y/N — m + 1 because the repeats occur at dif- 
ferent locations, and thus each state in the product space 
of the two registers is distinct. The introduction of the 
position register Q 2 has ensured this. Using the CNOT 
operation on \^d) the superposition, \^h), of Hamming 
states is once again obtained. The algorithm strategy is 
to search for alignments of increasing Hamming distance. 
At the start of each search it is not known how many 
solutions exist, or if there exist matches at all, and so 
Grover's algorithm cannot be used directly. However, we 
use now the extension of Grover's algorithm due to Boyer, 
Brassard, H0yer and Tapp, which performs a search with 
an a priori unknown number of solutions N t , and finds a 
match (if it exists) in 0(^/ N / N t ) steps [||. During the 
course of the algorithm the computer's evolution must be 
tailored to accommodate the fact that the search is now 
based on all the target states that satisfy T[i] = n where 
n is some pre-defined Hamming distance determined by 
the algorithm. In order to apply the search algorithm in 
this case the operator Is — > Is( n ) is modified such that: 

f - if T[i] = n 

l s (n)\fr) = l _ (13) 
[ \<f>i) otherwise. 

At each iteration the BBHT algorithm is employed, 
with a repeat index r as a pre-determined measure of the 
search confidence level. 

The optimal alignment algorithm is as follows: 

1. Oth iteration: search for an occurrence of the state 
with zero Hamming distance, T[i] = 0. If success- 
ful measure position and exit, if unsuccessful after 
r repeats of the BBHT search algorithm go to the 
next iteration. 

2. nth iteration: search for a state with T[i] = n using 
U = — Iff (n). If successful locate position and 
exit, if unsuccessful after r repeats of the BBHT 
search algorithm go to the next iteration, by set- 
ting n — > n + 1. 

3. Upon exit at some iteration n — k, one optimal 
alignment T[ik] = k, and its position ik has been 
found. 



The total number of steps required is 0(rkV7f), dis- 
counting the effect of sequence repeats (which reduces 
the required number of iterations). At more cost a sub- 
loop may be introduced to search for the other optimally 
aligned sequences. In practice, the number of iterations 
required is k << m, as one would determine a maximum 
Hamming distance on biological grounds, beyond which 
searching for an aligned state is pointless. 

While the focus of this paper has been on protein se- 
quence comparison, the framework can be easily trans- 
lated into that for nucleotide sequence comparison in 
DNA. In this case representing the four letter nucleotide 
alphabet requires only two qubits. 

Although only the algorithmic aspect of the applica- 
tion of quantum computing to sequence analysis has been 
dealt with here, an obvious point to raise is the feasibil- 
ity of building such a device. With the ever increasing 
ability to manipulate systems at the quantum level there 
has been great progress in the demonstration of quantum 
computation at the two qubit level. Quantum logic gates 
were demonstrated using ion traps (§] in 1995, and two 
years later in nuclear magnetic resonance (NMR) systems 
. In 1998 the actual experimental realization of a quan- 
tum computer solving Deutsch's problem was reported by 
two groups using NMR |K],[ll| ■ This was closely followed 
by NMR implementations of the quantum search algo- 
rithm |I^| , |l3| ]. Of course, a realistic quantum computer 
needs to be scaled up significantly on these two qubit con- 
figurations. Perhaps the most promising prospect for a 
scalable quantum computer capable of running the algo- 
rithms presented here is based on the solid state design of 
Kane |l4j . The creation of a superposition representing 
the human genome database would be another consider- 
able challenge. 

To conclude, in this work the application of quan- 
tum search algorithms in the context of bio computing 
has been studied, at least at the rather simple level of 
sequence alignment with respect to the Hamming dis- 
tance. Actual alignment problems would include align- 
ment through editing of sequences - i.e. insertion of gaps. 
It is quite possible that this procedure can be achieved 
using a multi-qubit representation (which includes gap 
characters) within the quantum search algorithm process 
by suitable choice of qubit evolution operators. Work in 
this direction is in progress. 
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